CISC 520-50- B-2023/Summer - Data Engineering and Mining

Deliverable 3: Final Project Report

Saurabh Shirish Prabhu

SPrabhu1@my.harrisburgu.edu

Dataset used:

1. National Longitudinal Study of Adolescent to Adult Health (Add Health) Wave I, 1994-1995 and
https://dataverse.unc.edu/dataset.xhtml?persistentId=doi:10.15139/S3/11900

2. National Longitudinal Study of Adolescent to Adult Health (Add Health) Wave IV, 2008
https://dataverse.unc.edu/dataset.xhtml?persistentId=doi:10.15139/S3/11920

The National Longitudinal Study of Adolescent to Adult Health (Add Health) is a longitudinal study of a nationally representative sample of adolescents in grades 7-12 in the United States during the 1994-95 school year. Add Health is a school-based longitudinal study of a nationally-representative sample of adolescents in grades 7-12 in the United States in 1994-95. Data have been collected from adolescents, their fellow students, school administrators, parents, siblings, friends, and romantic partners through multiple data collection components, including four respondent in-home interviews. In addition, existing data bases with information about respondents’ neighborhoods and communities have been merged with Add Health data, including variables on income and poverty, unemployment, availability and utilization of health services, crime, church membership, and social programs and policies.

The Add Health cohort has been followed into young adulthood with four in-home interviews, the most recent in 2008, when the sample was aged 24-32*. Add Health combines longitudinal survey data on respondents’ social, economic, psychological and physical well-being with contextual data on the family, neighborhood, community, school, friendships, peer groups, and romantic relationships, providing unique opportunities to study how social environments and behaviors in adolescence are linked to health and achievement outcomes in young adulthood. The fourth wave of interviews expanded the collection of biological data in Add Health to understand the social, behavioral, and biological linkages in health trajectories as the Add Health cohort ages through adulthood.

This study spans over 14 years from 1994 until most recent year 2008

Wave I The public use dataset for Wave I contains information collected in 1994-95 from Add Health’s nationally representative sample of adolescents. This dataset includes Wave I respondents and consists of one-half of the core sample, chosen at random, and one-half of the oversample of African-American adolescents with a parent who has a college degree. The total number of Wave I respondents in this dataset is approximately 6,500.

The Wave I public use dataset includes information from each of the following sources (as available): In-School Questionnaire Wave I In-Home Interview Add Health Picture Vocabulary Test (AHPVT), an abbreviated version of the Peabody Picture Vocabulary Test—Revised, with age-standardized scores for adolescent respondents Wave I Parent Questionnaire Contextual data In-school network data Weights

Wave IV Wave IV was designed to study the developmental and health trajectories across the life course of adolescence into young adulthood. Taking place in 2008, approximately 92.5% of the original Wave I respondents were located and 80.3% of eligible cases were interviewed. The Wave IV public use file contains data on 5,114 respondents, aged 24 to 32*. In Wave IV, biological data was also gathered in an attempt to acquire a greater understanding of predisease pathway s, with a specific focus on obesity, stress, and health risk behavior.

The Wave IV public use dataset includes the following data files: Wave IV In-home Interview File: variables from the in-home interview, including anthropometric measures Relationship Data Pregnancy Table File Live Births File Children and Parenting File Wave IV Weights Wave IV Public Use Biomarkers, Glucose Data Wave IV Public Use Biomarkers, Measures of EBV and hsCRP Wave IV Public Use Biomarkers, Lipids Data

Discription of data quality


The dataset combines longitudinal survey data on respondents’ social, economic, psychological and physical well-being with contextual data on the family, neighborhood, community, school, friendships, peer groups, and romantic relationships. The dataset has been followed into young adulthood with four in-home interviews, the most recent in 2008.


The quality of the data is high and it is considered to be one of the most comprehensive datasets on adolescent health and development. The dataset is also publicly available for researchers to use. However, there are some known sources of errors or biases. For example, the sample is not representative of all adolescents in the United States because it excludes those who dropped out of school before grade 7 or who were not enrolled in school during the 1994-95 school year. Additionally, there may be some measurement error due to self-reported data. Add Health oversampled schools with larger proportions of black and Hispanic students. Additionally, Add Health did not include students who were not enrolled in school at the time of Wave I.

The dataset is maintained at Odum Institute Data Archive
The Odum Institute Data Archive is a research data stewardship organization that provides long-term preservation and stewardship of research data assets to broaden scientific inquiry, promote research reproducibility, and foster data fluency. The archive is home to one of the largest catalogs of social science research data in the U.S., including the Harris Polls, North Carolina Vital Statistics, and the most complete collection of 1970s U.S. Census data. The institute offers services for data management plan development and implementation, finding & accessing data, data management training & education, and data curation for reproducibility training. UNC Dataverse is a web-based data repository that enables scientists, research teams, scholarly journals, and other members of the UNC research community to archive and share their own datasets .

Mining Methods

There are several data mining techniques that are used to extract useful information from large datasets. Some of the most popular ones include:

Association analysis: This technique is used to find association rules showing attribute-value conditions that occur frequently together in a given set of data. It is widely used for market basket or transaction data analysis.
Classification: This technique is used to classify data into predefined classes or categories based on their attributes.
Prediction: This technique is used to predict future trends or values based on historical data.
Clustering: This technique is used to group similar data points together based on their attributes.
Regression: This technique is used to establish a relationship between two or more variables in a dataset.
Artificial Neural Network (ANN) Classifier Method: This technique is used to classify data into predefined classes or categories based on their attributes using an artificial neural network.
Outlier Detection: This technique is used to identify unusual patterns or observations in a dataset.
Genetic Algorithm: This technique is used to optimize complex problems by simulating the process of natural selection.

References of data mining techniques:
https://www.geeksforgeeks.org/data-mining-techniques/
https://www.investopedia.com/terms/d/datamining.asp
https://www.ibm.com/topics/data-mining
https://www.javatpoint.com/data-processing-in-data-mining
https://www.springboard.com/blog/data-science/data-mining/

Hypothesis

How are various features related to educational resources, health resources and various opportunities infulencing participants highest education level?

Mining Methods and Analysis Proposal

In this study following methods were proposed

Regression analysis is proposed to understand various patterns in adolescent to adult health patterns. Logistic regression is proposed to understand relationships between binary target variable and various features.
Classification is proposed to understand Y/N type participants education level patterns
Prediction: is proposed to check if the classification and regression models can predict participants education level patterns.
Feature Selection: Feature selection techniques help in choosing the most relevant and informative features for building models, reducing complexity and improving model performance.
Ensemble Methods: Ensemble methods combine multiple models to improve prediction accuracy and reduce overfitting. Bagging (e.g., Random Forest) and Boosting (e.g., Gradient Boosting Machines) are common ensemble techniques.

In [1]:
#import required packages
import numpy as np
import pandas as pd

Wave I dataset:

In [2]:
df_1994 = pd.read_sas('F:/a_Harrisburg_University_Academics/CISC 520-50-O-2023Summer- Big Data Mining/Term Project/01 Dataset selection/Datasets/dataverse_files_wave1-1994/w1inhome.sas7bdat')
df_1994
Warning: column count mismatch (290 + 343 != 2794)

E:\Anaconda3\lib\site-packages\pandas\io\sas\sas7bdat.py:800: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
  rslt[name] = self._byte_chunk[jb, :].view(dtype=self.byte_order + "d")
E:\Anaconda3\lib\site-packages\pandas\io\sas\sas7bdat.py:809: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
  rslt[name] = self._string_chunk[js, :]
Out[2]:
AID IMONTH IDAY IYEAR SCH_YR BIO_SEX VERSION SMP01 SMP03 H1GI1M ... PD4A PD4B PD4C PD4D PD4E PD4F PD5 PD5A AH_PVT AH_RAW
0 b'57100270' 6.0 23.0 95.0 1.0 2.0 b'4' 0.0 1.0 10.0 ... 7.0 7.0 7.0 7.0 7.0 7.0 1.0 1.0 86.0 55.0
1 b'57101310' 5.0 5.0 95.0 1.0 2.0 b'1' 1.0 0.0 11.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 88.0 58.0
2 b'57103171' 6.0 27.0 95.0 0.0 1.0 b'4' 1.0 0.0 10.0 ... 7.0 7.0 7.0 7.0 7.0 7.0 1.0 0.0 120.0 79.0
3 b'57103869' 7.0 14.0 95.0 0.0 1.0 b'4' 1.0 0.0 1.0 ... 7.0 7.0 7.0 7.0 7.0 7.0 NaN NaN 85.0 56.0
4 b'57104553' 7.0 14.0 95.0 1.0 2.0 b'4' 1.0 0.0 6.0 ... 7.0 7.0 7.0 7.0 7.0 7.0 1.0 0.0 90.0 59.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6499 b'99719930' 7.0 18.0 95.0 0.0 2.0 b'5' 1.0 0.0 4.0 ... 7.0 7.0 7.0 7.0 7.0 7.0 1.0 0.0 94.0 57.0
6500 b'99719939' 5.0 3.0 95.0 1.0 1.0 b'1' 1.0 1.0 6.0 ... 7.0 7.0 7.0 7.0 7.0 7.0 1.0 0.0 NaN NaN
6501 b'99719970' 7.0 11.0 95.0 0.0 1.0 b'4' 1.0 0.0 4.0 ... 7.0 7.0 7.0 7.0 7.0 7.0 1.0 0.0 91.0 56.0
6502 b'99719976' 6.0 26.0 95.0 0.0 2.0 b'4' 1.0 0.0 12.0 ... 7.0 7.0 7.0 7.0 7.0 7.0 1.0 0.0 82.0 50.0
6503 b'99719978' 6.0 4.0 95.0 1.0 1.0 b'3' 1.0 0.0 9.0 ... 7.0 7.0 7.0 7.0 7.0 7.0 1.0 0.0 102.0 62.0

6504 rows × 2794 columns

Wave I dataset analytic

In [3]:
df_1994.shape
Out[3]:
(6504, 2794)
In [4]:
df_1994.describe()
Out[4]:
IMONTH IDAY IYEAR SCH_YR BIO_SEX SMP01 SMP03 H1GI1M H1GI1Y H1GI2 ... PD4A PD4B PD4C PD4D PD4E PD4F PD5 PD5A AH_PVT AH_RAW
count 6504.000000 6504.000000 6504.000000 6504.000000 6504.000000 6504.000000 6504.000000 6504.000000 6504.000000 6504.000000 ... 5655.000000 5655.000000 5655.000000 5655.000000 5655.000000 5655.000000 5066.000000 5227.000000 6223.000000 6223.000000
mean 6.577645 15.735855 94.999846 0.354397 1.516759 0.933579 0.079951 6.604090 78.970941 0.542589 ... 6.926260 6.932626 6.943943 6.926083 6.918126 6.909814 0.975326 0.103118 100.624297 64.790133
std 1.403560 8.708040 0.012400 0.483165 0.502825 0.249035 0.271238 3.905455 1.809704 0.575287 ... 0.565212 0.565381 0.527495 0.601716 0.630985 0.675249 0.155146 0.304143 15.079536 11.088681
min 1.000000 1.000000 94.000000 0.000000 1.000000 0.000000 0.000000 1.000000 74.000000 0.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000 0.000000 14.000000 0.000000
25% 5.000000 8.000000 95.000000 0.000000 1.000000 1.000000 0.000000 4.000000 78.000000 0.000000 ... 7.000000 7.000000 7.000000 7.000000 7.000000 7.000000 1.000000 0.000000 91.000000 58.000000
50% 6.000000 15.000000 95.000000 0.000000 2.000000 1.000000 0.000000 7.000000 79.000000 1.000000 ... 7.000000 7.000000 7.000000 7.000000 7.000000 7.000000 1.000000 0.000000 101.000000 65.000000
75% 8.000000 23.000000 95.000000 1.000000 2.000000 1.000000 0.000000 9.000000 80.000000 1.000000 ... 7.000000 7.000000 7.000000 7.000000 7.000000 7.000000 1.000000 0.000000 112.000000 74.000000
max 12.000000 31.000000 95.000000 6.000000 6.000000 1.000000 1.000000 96.000000 96.000000 8.000000 ... 8.000000 9.000000 9.000000 9.000000 8.000000 8.000000 1.000000 1.000000 139.000000 87.000000

8 rows × 2791 columns

Wave IV dataset:

In [5]:
df_2008 = pd.read_sas('F:/a_Harrisburg_University_Academics/CISC 520-50-O-2023Summer- Big Data Mining/Term Project/01 Dataset selection/Datasets/dataverse_files/w4inhome.sas7bdat')


df_2008 = pd.DataFrame(data=df_2008)

df_2008
Warning: column count mismatch (176 + 255 != 920)

E:\Anaconda3\lib\site-packages\pandas\io\sas\sas7bdat.py:800: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
  rslt[name] = self._byte_chunk[jb, :].view(dtype=self.byte_order + "d")
E:\Anaconda3\lib\site-packages\pandas\io\sas\sas7bdat.py:809: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
  rslt[name] = self._string_chunk[js, :]
Out[5]:
AID IMONTH4 IDAY4 IYEAR4 BIO_SEX4 VERSION4 BREAK_Q PRYEAR4 PRETEST4 PRISON4 ... H4EO5C H4EO5D H4EO5E H4EO5F H4EO5G H4EO5H H4EO5I H4EO5J H4EO6 H4EO7
0 b'57101310' 5.0 6.0 2008.0 2.0 b'V5.4' b'NO' 2001.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 1.0
1 b'57103869' 5.0 22.0 2008.0 1.0 b'V5.4' b'NO' 2002.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4.0 1.0
2 b'57109625' 11.0 2.0 2008.0 1.0 b'V5.5' b'NO' 2002.0 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 b'57111071' 6.0 29.0 2008.0 1.0 b'V5.4' b'NO' 2001.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 2.0
4 b'57113943' 11.0 11.0 2008.0 1.0 b'V5.5' b'NO' 2002.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.0 3.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5109 b'99719930' 6.0 7.0 2008.0 2.0 b'V5.4' b'NO' 2001.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 4.0 2.0
5110 b'99719939' 2.0 13.0 2008.0 1.0 b'V5.1' b'NO' 2001.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 2.0
5111 b'99719970' 3.0 22.0 2008.0 1.0 b'V5.2' b'NO' 1996.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 4.0 3.0
5112 b'99719976' 4.0 1.0 2008.0 2.0 b'V5.2' b'NO' 2001.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 2.0
5113 b'99719978' 4.0 16.0 2008.0 1.0 b'V5.3' b'NO' 1995.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4.0 2.0

5114 rows × 920 columns

Wave IV dataset analytic

In [6]:
df_2008.shape
Out[6]:
(5114, 920)
In [7]:
df_2008.describe()
Out[7]:
IMONTH4 IDAY4 IYEAR4 BIO_SEX4 PRYEAR4 PRETEST4 PRISON4 H4OD1M H4OD1Y H4OD2A ... H4EO5C H4EO5D H4EO5E H4EO5F H4EO5G H4EO5H H4EO5I H4EO5J H4EO6 H4EO7
count 5114.000000 5114.000000 5114.000000 5114.000000 5114.000000 5114.000000 5114.000000 5114.000000 5114.000000 5114.000000 ... 4673.000000 4673.000000 4673.000000 4673.000000 4673.000000 4673.000000 4673.000000 4673.000000 4674.000000 4674.000000
mean 5.222135 15.591318 2008.000196 1.539890 2000.261830 0.008995 0.005866 6.602073 1978.995894 0.997458 ... 0.012198 0.015836 0.010914 0.027819 0.038733 0.036379 0.017334 0.010272 3.301883 1.395593
std 2.746748 8.887742 0.134866 0.498455 2.189283 0.094423 0.076374 3.413174 1.775208 0.050359 ... 0.256885 0.263679 0.254431 0.284587 0.301959 0.298331 0.266412 0.253193 3.535308 0.659350
min 1.000000 1.000000 2007.000000 1.000000 1995.000000 0.000000 0.000000 1.000000 1974.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000
25% 3.000000 8.000000 2008.000000 1.000000 2001.000000 0.000000 0.000000 4.000000 1978.000000 1.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.000000 1.000000
50% 5.000000 15.000000 2008.000000 2.000000 2001.000000 0.000000 0.000000 7.000000 1979.000000 1.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.000000 1.000000
75% 7.000000 23.000000 2008.000000 2.000000 2001.000000 0.000000 0.000000 10.000000 1980.000000 1.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4.000000 2.000000
max 12.000000 31.000000 2009.000000 2.000000 2002.000000 1.000000 1.000000 12.000000 1983.000000 1.000000 ... 7.000000 7.000000 7.000000 7.000000 7.000000 7.000000 7.000000 7.000000 97.000000 7.000000

8 rows × 915 columns

Conclusion and final remarks on deliverable 01 Dataset selection

Final dataset will be a combination of both wave I and wave IV datasets. The aim is to understand factors leading to Adolescent to Adult success. Understand how behavior patterns, traits, enviornment, parenting etc affects success and healthly life.


Deliverable 2: Data Exploration and Mining Methods Proposal

In [8]:
#!pip install cufflinks plotly
In [9]:
#!pip install chart-studio 
#!pip install plotly jupyterlab --user
In [10]:
# importing some plotly modules

import chart_studio
import plotly.graph_objs as go
from plotly.offline import iplot, init_notebook_mode

# importing cufflinksand setting to enable using plotly offline

import cufflinks
cufflinks.go_offline(connected=True)
init_notebook_mode(connected=False) # connected = True for using online mode. 


import plotly.express as px

# conventional plotting libraries

import matplotlib.pyplot as plt
import seaborn as sns


## reference : plotly errors : https://stackoverflow.com/questions/52771328/plotly-chart-not-showing-in-jupyter-notebook
In [11]:
#display all column names of DataFrame df_1994 for reference
# print(df_1994.columns.tolist())

Correlation matrix and correlation plot

Since the dataset is very big and computation time is large, we will plot limited number of columns.
Here df_corr is a correlation matrix for :n columns.

In [12]:
import copy
df_1994_copy = copy.deepcopy(df_1994)
In [13]:
# df_corr = df_1994_copy.iloc[:, :30].corr() # Generate correlation matrix of first 30 columns ..


# # fig = go.Figure()
# # layout = {"title": "Add Title in plotly"}
# # fig.add_trace(
# #     go.Heatmap(
# #         x = df_corr.columns,
# #         y = df_corr.index,
# #         z = np.array(df_corr)
# #         )
# # )


# fig = go.Figure(
#     data=go.Heatmap(
#         x = df_corr.columns,
#         y = df_corr.index,
#         z = np.array(df_corr)
#         ), 
#     layout=go.Layout(
#         title="Correlation matrix of n columns",
#         xaxis=dict(title='corr matrix columns'),
#         yaxis=dict(title='corr matrix indexes')
#     ),
# )
# fig.show()

# ## reference : 
# ## https://en.ai-research-collection.com/plotly-heatmap/
# ## https://stackoverflow.com/questions/59059378/title-for-colorbar-in-plotly-heatmap

Histograms

Histogram of participants household work (help) using plotly library

In [14]:
# df_1994['H1DA1'].iplot(kind='hist', xTitle='work', yTitle='count', title='how many times did the participant do work around the house')

Histogram of participants weight using conventional and plotly library

H1GH60 - what is your weight?

range 50 to 360 pounds
996 = refused
998 = don't know
999 = not applicable

In [15]:
## how many times did the participant do work around the house
%matplotlib inline 
df_1994["H1GH60"].plot(kind="hist")
Out[15]:
<AxesSubplot:ylabel='Frequency'>
In [16]:
df_1994_weight = df_1994[(df_1994['H1GH60']>=0) & (df_1994['H1GH60']<=900)] ## removing 996, 998, 999 
#df_1994_weight["H1GH60"].plot(kind="hist")


df_1994_weight['H1GH60'].iplot(kind='hist', xTitle='Weight pounds', yTitle='count', title='Participants weight in pounds')

How many hours did the participant watch TV?

H1DA8 is a quantitative variable. But 996 and 998 are coded as refused and dont know. Removing 996 and 998

H1DA8 : How many hours a week do you watch television?
0 = does not wach TV
1 to 99 = range 1 to 99 hours
996 = refused
998 = don't know

In [17]:
df_1994_TV = df_1994[(df_1994['H1DA8']>=0) & (df_1994['H1DA8']<=900)]  


df_1994_TV['H1DA8'].iplot(kind='hist', xTitle='number of hrs watched TV', yTitle='count', title='How many hours did the participant watch TV?')
In [18]:
#!pip install --upgrade seaborn

Preliminary research question

Does watching TV affect participants grades in mathematics subject?

Variables under investigation

H1ED12 : what was your grade in mathematics?
1 = A
2 = B
3 = C
4 = D or lower
5 = didnt take math
6 = different grading pattern
96 = refused
97 = skipped
98 = dont know

H1DA8 : How many hours a week do you watch television?
0 = does not wach TV
1 to 99 = range 1 to 99 hours
996 = refused
998 = don't know

BIO_SEX: Interviewer, please confirm that R’s sex is (male) female
1 = male
2 = female
6 = refused

In [19]:
sns.barplot(x="H1ED12", y="H1DA8", data=df_1994, ci=None)
plt.xlabel('Grade: 1 = A, 2 = B, 3 = C, 4 = D or lower, 5 = didnt take math, 6 = different grading pattern, 96 = refused, 97 = skipped, 98 = dont know ')
plt.ylabel('TV watching hours hrs/week')
E:\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning:



The `ci` parameter is deprecated. Use `errorbar=None` for the same effect.


Out[19]:
Text(0, 0.5, 'TV watching hours hrs/week')

Data transformation:

Converting grades B or less into one category and A into another category for analysis

In [20]:
df_1994_copy = df_1994_copy[(df_1994_copy['H1DA8']>=0) & (df_1994_copy['H1DA8']<=99)]
df_1994_copy.loc[df_1994_copy['H1ED12'] != 1 , 'H1ED12'] = 0



sns.barplot(x="H1ED12", y="H1DA8", data=df_1994_copy, errorbar=('ci', True))
plt.xlabel('Grade:\n 0 = B or less \n  1 = A')
plt.ylabel('TV watching hours hrs/week')
Out[20]:
Text(0, 0.5, 'TV watching hours hrs/week')

How does gender affect TV watching hours hrs/week vs grades ?

In [21]:
fig3 = px.box(
    data_frame = df_1994_copy
    ,y = 'H1DA8'
    ,x = 'H1ED12'
    ,color = 'BIO_SEX'
    ,labels={
                     "H1DA8": "TV watching hours hrs/week",
                     "H1ED12": "Grade:1 = A ; \n 0 = B or less \n  ",
                     "BIO_SEX": "Gender <br> 1 = male <br> 2 = female"}
              )
 
fig3.show()



#df_1994_copy.iplot(x='H1ED12', y='H1DA8', categories='BIO_SEX', xTitle='Grade:\n 0 = B or less \n  1 = A', yTitle='TV watching hours hrs/week', title='Time spend watching TV vs Grades')

Preliminary Observation

Students with A grade in math tend to spend less hours watching TV

Merging two datasets

In [22]:
df_merged = pd.merge(df_1994, df_2008, on="AID", how="left")
df_merged.shape
Out[22]:
(6504, 3713)
In [23]:
2794+920-1 ## Number of columns = df_1994 + df_2008 and number of rows = df_1994
Out[23]:
3713

Create a dataset with features under investigation

To understand depression, some of the follwoing features were investigated:

  1. BMI
  2. Age = 2008 - birth year

Calculating BMI for childhood year 1994 and adulthood year 2008

In [24]:
df_bmi_94 = df_merged[(df_merged['H1GH60']>=0) & (df_merged['H1GH60']<=900)][['AID', 'H1GH60','H1GH59A','H1GH59B']]
df_bmi_94 = df_bmi_94[(df_bmi_94['H1GH59B']>=0) & (df_bmi_94['H1GH59B']<=90)]
df_bmi_94 = df_bmi_94[(df_bmi_94['H1GH59A']>=0) & (df_bmi_94['H1GH59A']<=90)]
df_bmi_94['HEIGHT94'] = df_bmi_94['H1GH59A']*12 + df_bmi_94['H1GH59B']
df_bmi_94['BMI94'] = 703*df_bmi_94['H1GH60'] / (df_bmi_94['HEIGHT94'])**2
df_bmi_94 = df_bmi_94[['AID', 'H1GH60','HEIGHT94','BMI94']]
df_bmi_94['H1GH60'] = df_bmi_94['H1GH60']*0.45359237 # convert weight t0 kg
df_bmi_94['HEIGHT94'] = df_bmi_94['HEIGHT94']*2.54
df_bmi_94
Out[24]:
AID H1GH60 HEIGHT94 BMI94
0 b'57100270' 99.790321 157.48 40.234131
1 b'57101310' 68.946040 182.88 20.612654
3 b'57103869' 102.058283 205.74 24.108368
4 b'57104553' 90.718474 170.18 31.321007
5 b'57104649' 54.431084 162.56 20.595703
... ... ... ... ...
6499 b'99719930' 50.348753 165.10 18.469349
6500 b'99719939' 65.770894 175.26 21.410418
6501 b'99719970' 63.502932 165.10 23.294675
6502 b'99719976' 73.481964 165.10 26.955266
6503 b'99719978' 61.234970 165.10 22.462722

6291 rows × 4 columns

In [25]:
df_bmi_08 = df_merged[(df_merged['H4WGT']>=0) & (df_merged['H4WGT']<=880)][['AID', 'H4WGT','H4HGT']]
df_bmi_08 = df_bmi_08[(df_bmi_08['H4HGT']>=0) & (df_bmi_08['H4HGT']<=900)]
df_bmi_08['BMI08'] = df_bmi_08['H4WGT'] / (df_bmi_08['H4HGT']/100)**2
df_bmi_08
Out[25]:
AID H4WGT H4HGT BMI08
1 b'57101310' 113.9 180.0 35.154321
3 b'57103869' 107.8 202.0 26.418979
7 b'57109625' 68.0 161.0 26.233556
9 b'57111071' 89.4 177.0 28.535861
11 b'57113943' 150.6 185.5 43.766029
... ... ... ... ...
6499 b'99719930' 58.6 167.5 20.886612
6500 b'99719939' 97.0 174.5 31.855239
6501 b'99719970' 80.6 178.0 25.438707
6502 b'99719976' 75.3 165.0 27.658402
6503 b'99719978' 78.2 180.0 24.135802

5042 rows × 4 columns

In [26]:
df_bmi = pd.merge(df_bmi_94, df_bmi_08, on="AID", how="left").rename(
    columns={'H4WGT':'WEIGHT_08','H4HGT':'HEIGHT_08','BMI08':'BMI_08',
            'H1GH60':'WEIGHT_94','HEIGHT94':'HEIGHT_94','BMI94':'BMI_94'})
In [27]:
# Data Visualization
num_feats = df_bmi.select_dtypes(include=[np.number])
plt.figure(figsize=(12, 8))
for i, feature in enumerate(num_feats.columns):
    plt.subplot(3, 3, i+1)
    sns.histplot(data=num_feats, x=feature, kde=True)
    plt.title(f"{feature} Distribution")
plt.tight_layout()
plt.show()
In [28]:
# BIO_SEX


df_sex = df_merged[['AID','BIO_SEX']]
df_sex
Out[28]:
AID BIO_SEX
0 b'57100270' 2.0
1 b'57101310' 2.0
2 b'57103171' 1.0
3 b'57103869' 1.0
4 b'57104553' 2.0
... ... ...
6499 b'99719930' 2.0
6500 b'99719939' 1.0
6501 b'99719970' 1.0
6502 b'99719976' 2.0
6503 b'99719978' 1.0

6504 rows × 2 columns

In [29]:
df_bmi_sex = pd.merge(df_bmi, df_sex, on="AID", how="left")
In [30]:
# Box plots with numerical independent variables against the dependent variable 'HeartDisease'
numerical_features = df_bmi_sex.select_dtypes(include=[np.number])

# Number of rows and columns in the numerical features DataFrame
num_rows, num_cols = numerical_features.shape

# Number of subplots required (excluding the dependent variable 'HeartDisease')
num_plots = num_cols - 1

# Setting the size of the box plot grid
plt.figure(figsize=(12, 8))

# Loop through each numerical independent variable and create the box plot with hue as 'HeartDisease'
for i, feature in enumerate(numerical_features.columns[:-1]):
    plt.subplot(2, 3, i + 1)
    sns.boxplot(data=df_bmi_sex, x='BIO_SEX', y=feature, palette='coolwarm')
    plt.title(f"{feature} vs sex")
    plt.xlabel('sex 1:male, 2:Female')
    plt.ylabel(feature)

plt.tight_layout()
plt.show()
In [31]:
# Delta BMI

df_deltaBMI = df_bmi_sex
df_deltaBMI['deltaBMI'] = df_bmi_sex['BMI_08']-df_bmi_sex['BMI_94']

df_deltaBMI = df_deltaBMI[['AID','BIO_SEX','BMI_08','deltaBMI']]
df_deltaBMI



num_feats = df_deltaBMI.select_dtypes(include=[np.number])
plt.figure(figsize=(12, 8))
for i, feature in enumerate(num_feats.columns):
    plt.subplot(3, 3, i+1)
    sns.histplot(data=num_feats, x=feature, kde=True)
    plt.title(f"{feature} Distribution")
plt.tight_layout()
plt.show()


## Some participants have undergone reduction in BMI 
In [32]:
df_merged = pd.merge(df_merged,df_deltaBMI[['AID','BMI_08','deltaBMI']], on="AID", how="left")  #df_deltaBMI[['AID','deltaBMI']]
df_merged.shape
Out[32]:
(6504, 3715)
In [33]:
df_merged['BMI_08']
Out[33]:
0             NaN
1       35.154321
2             NaN
3       26.418979
4             NaN
          ...    
6499    20.886612
6500    31.855239
6501    25.438707
6502    27.658402
6503    24.135802
Name: BMI_08, Length: 6504, dtype: float64

Filter a few features for prediction of participants education level

Selecting following features from datases.

Feature Description Type
AID Participant unique identifier Object
BIO_SEX Gender 1 R is male, 2 R is female, 6 refused Categorical
H4OD1Y Respondent's date of birth – year 1974 - 1983 Numerical
H1DA6 During the past week, how many times did you do exercise: 0 : not at all, 1 : 1 or 2 times, 2 : 3 or 4 times, 3 : 5 or more times, 6 : refused,8 : don’t know Categorical
H4DA1 how many hours did you watch television? 1-150 hours, 996: refused, 998: dont know Numerical
H4SP1H what time do you usually wake up? 1-12 hours, 96: refused, 98: dont know Numerical
H4PE7 I'm always optimistic about my future- 1: strongly agree ,2: agree ,3: neither agree nor disagree ,4: disagree ,5: strongly disagree ,6: refused ,8: don't know , .: missing Categorical
H4GH1 how is your health?-1: excellent ,2: very good ,3: good ,4: fair ,5: poor Categorical
H1ED14 Grade in science? - 1: A ,2: B ,3: C ,4: D or lower ,5: didn’t take this subject ,6: took the subject, but it wasn’t graded this way ,96: refused ,97: legitimate skip ,98: don’t know Categorical
H1ED13 Grade in history or social studies?:A ,2: B ,3: C ,4: D or lower ,5: didn’t take this subject ,6: took the subject, but it wasn’t graded this way ,96: refused ,97: legitimate skip ,98: don’t know Categorical
H1ED12 Grade in mathematics? A ,2: B ,3: C ,4: D or lower ,5: didn’t take this subject ,6: took the subject, but it wasn’t graded this way ,96: refused ,97: legitimate skip ,98: don’t know Categorical
H1GH51 How many hours of sleep do you usually get? 1-20 hours ,96: refused ,98: don’t know Numerical
H1DA8 How many hours a week do you watch television? 0 hrs, 1-99 hrs, 996 refused, 998 don't know Numerical
H1GH1 how is your health? 1: excellent ,2: very good ,3: good ,4: fair ,5: poor, 6:refused, 8: don't know Categorical
H1DA10 How many hours a week do you play video or computer games? 0: don't play, 1 - 99 hrs, 996: refused, 998 don't know Numerical
BMI_08 Bmi in 2008 calculated in above steps Numerical
deltaBMI Change Bmi from 1994 to 2008 as calculated in above steps Numerical
H4ED2 (TARGET) highest level of education: 1: 8th grade or less ,2: some high school ,3: high school graduate ,4: some vocational/technical training (after high school) ,5: completed vocational/technical training (after high school) ,6: some college ,7: completed college (bachelor's degree) ,8: some graduate school ,9: completed a master's degree ,10: some graduate training beyond a master's degree ,11: completed a doctoral degree ,12: some post baccalaureate professional education (e.g., law school) ,13: completed post baccalaureate professional education (e.g., law school, med school, nurse) ,98: don't know Categorical

These features were randomly selected based on related or unrelated work from peer reviewed references.

In [34]:
df_study = df_merged[(df_merged['BIO_SEX']>=0) & (df_merged['BIO_SEX']<=5)][['AID','H4OD1Y','BIO_SEX','H1DA6','H4DA1','H4SP1H', 'H4ED2','H4PE7','H4GH1','H1ED14','H1ED13','H1ED12','BMI_08','H1GH51','H1DA8','H1GH1','H1DA10','deltaBMI']]
df_study['BIO_SEX'] = df_study['BIO_SEX'].astype('int')

# H1DA8 (Wave 1) number of hours spent in watching television per week
df_study = df_study[(df_study['H1DA8']>=0) & (df_study['H1DA8']<995)] # remove skip and dont know

# H1DA6 (Wave 1) past week, how many times did you do exercise? recode to yes / no
df_study = df_study[(df_study['H1DA6']>=0) & (df_study['H1DA6']<6)] # remove skip and dont know
df_study.loc[df_study['H1DA6'] <1 , 'H1DA6'] = 0  # did not excerice
df_study.loc[df_study['H1DA6'] >=1 , 'H1DA6'] = 1 # excericed
df_study['H1DA6'] = df_study['H1DA6'].astype('int')



# H1DA10 (Wave 1) How many hours a week do you play video or computer games? recode to yes / no
df_study = df_study[(df_study['H1DA10']>=0) & (df_study['H1DA10']<995)] # remove skip and dont know
df_study.loc[df_study['H1DA10'] <=1 , 'H1DA10'] = 0  # did not play pc games
df_study.loc[df_study['H1DA10'] >1 , 'H1DA10'] = 1 # played
df_study['H1DA10'] = df_study['H1DA10'].astype('int')



# H1GH1 (Wave 1) n general, how is your health?
df_study = df_study[(df_study['H1GH1']>=0) & (df_study['H1GH1']<6)] # recode health status, remove skip and dont know
df_study.loc[df_study['H1GH1'] <=3 , 'H1GH1'] = 1  # good
df_study.loc[df_study['H1GH1'] >3 , 'H1GH1'] = 0 # bad
df_study['H1GH1'] = df_study['H1GH1'].astype('int')


# Studies and grades
# H1ED12 (Wave 1)  what was your grade in mathematics? 
df_study = df_study[(df_study['H1ED12']>=0) & (df_study['H1ED12']<5)] # recode grade status, remove skip and dont know
df_study.loc[df_study['H1ED12'] <=1 , 'H1ED12'] = 1  # A
df_study.loc[df_study['H1ED12'] >1 , 'H1ED12'] = 0 # B or less
df_study['H1ED12'] = df_study['H1ED12'].astype('int')


# H1ED14 (wave 1) what was your grade in science?
df_study = df_study[(df_study['H1ED14']>=0) & (df_study['H1ED14']<5)] # recode grade status, remove skip and dont know
df_study.loc[df_study['H1ED14'] <=1 , 'H1ED14'] = 1  # A
df_study.loc[df_study['H1ED14'] >1 , 'H1ED14'] = 0 # B or less
df_study['H1ED14'] = df_study['H1ED14'].astype('int')



# H1ED13 (wave 1) what was your grade in history or social studies?
df_study = df_study[(df_study['H1ED13']>=0) & (df_study['H1ED13']<5)] # recode grade status, remove skip and dont know
df_study.loc[df_study['H1ED13'] <=1 , 'H1ED13'] = 1  # A
df_study.loc[df_study['H1ED13'] >1 , 'H1ED13'] = 0 # B or less
df_study['H1ED13'] = df_study['H1ED13'].astype('int')




#Health

# H4OD1Y (Wave 4) Birth year
df_study = df_study[(df_study['H4OD1Y']>=0) & (df_study['H4OD1Y']<1983)] # remove skip and dont know
df_study['H4OD1Y'] = 2008 - df_study['H4OD1Y'] # remove skip and dont know

# H1GH51 How many hours of sleep do you usually get? 
df_study = df_study[(df_study['H1GH51']>=0) & (df_study['H1GH51']<86)] # recode sleep hours, remove skip and dont know

# H4SP1H (Wave 4) On the days you go to work, school or similar activities, what time do you usually wake up? [Hour] 
df_study = df_study[(df_study['H4SP1H']>=0) & (df_study['H4SP1H']<86)] # recode sleep hours, remove skip and dont know

# H4GH1 (wave 4) In general, how is your health? 
df_study = df_study[(df_study['H4GH1']>=0) & (df_study['H4GH1']<6)] # recode grade status, remove skip and dont know
df_study.loc[df_study['H4GH1'] <=3 , 'H4GH1'] = 1  # great/good
df_study.loc[df_study['H4GH1'] >3 , 'H4GH1'] = 0 # poor
df_study['H4GH1'] = df_study['H4GH1'].astype('int')

# H4PE7  (wave 4) I'm always optimistic about my future
df_study = df_study[(df_study['H4PE7']>=0) & (df_study['H4PE7']<6)] # recode grade status, remove skip and dont know
df_study.loc[df_study['H4PE7'] <3 , 'H4PE7'] = 1  # agree
df_study.loc[df_study['H4PE7'] >=3 , 'H4PE7'] = 0 # disagress or niether
df_study['H4PE7'] = df_study['H4PE7'].astype('int')



# H4DA1 In the past seven days, how many hours did you watch television? 
df_study = df_study[(df_study['H4DA1']>=0) & (df_study['H4DA1']<995)] # recode sleep hours, remove skip and dont know


#target variable > highest level of education

# H4ED2 (Wave 4) highest level of education recode education status <target categorical
df_study = df_study[(df_study['H4ED2']>=0) & (df_study['H4ED2']<97)] # recode education status, remove skip and dont know
df_study.loc[df_study['H4ED2'] <=6 , 'H4ED2'] = 0  # some college or less
df_study.loc[df_study['H4ED2'] >6 , 'H4ED2'] = 1 # graduate or more
df_study['H4ED2'] = df_study['H4ED2'].astype('int')






num_feats = df_study.select_dtypes(include=[np.number])
plt.figure(figsize=(12, 8))
for i, feature in enumerate(num_feats.columns):
    plt.subplot(5, 4, i+1)
    sns.histplot(data=num_feats, x=feature, kde=True)
    plt.title(f"{feature} Distribution")
plt.tight_layout()
plt.show()
In [35]:
# df_study.dtypes
In [36]:
df_study.shape
Out[36]:
(3871, 18)
In [37]:
df_study.isna().sum()
Out[37]:
AID           0
H4OD1Y        0
BIO_SEX       0
H1DA6         0
H4DA1         0
H4SP1H        0
H4ED2         0
H4PE7         0
H4GH1         0
H1ED14        0
H1ED13        0
H1ED12        0
BMI_08      141
H1GH51        0
H1DA8         0
H1GH1         0
H1DA10        0
deltaBMI    141
dtype: int64
In [38]:
data = df_study.copy()
corr = data.corr()
print(corr)

sns.heatmap(corr.round(2) , annot=True)
            H4OD1Y   BIO_SEX     H1DA6     H4DA1    H4SP1H     H4ED2  \
H4OD1Y    1.000000 -0.064017 -0.076857 -0.010816 -0.075286  0.005431   
BIO_SEX  -0.064017  1.000000  0.096120 -0.035256 -0.016997  0.104466   
H1DA6    -0.076857  0.096120  1.000000 -0.001467  0.023884  0.049517   
H4DA1    -0.010816 -0.035256 -0.001467  1.000000  0.077603 -0.098553   
H4SP1H   -0.075286 -0.016997  0.023884  0.077603  1.000000 -0.001475   
H4ED2     0.005431  0.104466  0.049517 -0.098553 -0.001475  1.000000   
H4PE7     0.013435 -0.027010 -0.001155 -0.041053 -0.007256  0.043312   
H4GH1    -0.006891 -0.050464  0.008057 -0.077772 -0.044851  0.138556   
H1ED14   -0.081633  0.090587  0.045425 -0.052742 -0.014347  0.312227   
H1ED13   -0.033495  0.086900  0.040592 -0.059191 -0.021354  0.320582   
H1ED12   -0.076282  0.054925  0.023956 -0.048839 -0.008450  0.247443   
BMI_08    0.052950  0.019614 -0.009867  0.087714 -0.015468 -0.136422   
H1GH51   -0.244623 -0.040788  0.018209  0.000619  0.004515 -0.034505   
H1DA8    -0.106669 -0.056995 -0.032492  0.174549  0.036632 -0.103852   
H1GH1    -0.003649 -0.054540  0.077017 -0.015919  0.005841  0.103365   
H1DA10   -0.123981 -0.266654 -0.000909  0.076226  0.031966 -0.044745   
deltaBMI -0.094812  0.034812 -0.005084  0.079991  0.002367 -0.110930   

             H4PE7     H4GH1    H1ED14    H1ED13    H1ED12    BMI_08  \
H4OD1Y    0.013435 -0.006891 -0.081633 -0.033495 -0.076282  0.052950   
BIO_SEX  -0.027010 -0.050464  0.090587  0.086900  0.054925  0.019614   
H1DA6    -0.001155  0.008057  0.045425  0.040592  0.023956 -0.009867   
H4DA1    -0.041053 -0.077772 -0.052742 -0.059191 -0.048839  0.087714   
H4SP1H   -0.007256 -0.044851 -0.014347 -0.021354 -0.008450 -0.015468   
H4ED2     0.043312  0.138556  0.312227  0.320582  0.247443 -0.136422   
H4PE7     1.000000  0.083155  0.019920  0.016236 -0.008393  0.007739   
H4GH1     0.083155  1.000000  0.054684  0.061387  0.052189 -0.159617   
H1ED14    0.019920  0.054684  1.000000  0.369217  0.274038 -0.045384   
H1ED13    0.016236  0.061387  0.369217  1.000000  0.265709 -0.092054   
H1ED12   -0.008393  0.052189  0.274038  0.265709  1.000000 -0.078425   
BMI_08    0.007739 -0.159617 -0.045384 -0.092054 -0.078425  1.000000   
H1GH51    0.038913  0.034059  0.011985 -0.001929  0.016559 -0.007152   
H1DA8    -0.007208 -0.055316 -0.072201 -0.067868 -0.070515  0.087938   
H1GH1     0.027795  0.150447  0.038801  0.079014  0.053747 -0.141892   
H1DA10   -0.027351 -0.005689 -0.033571 -0.038302 -0.005193  0.005908   
deltaBMI  0.014994 -0.122984 -0.019515 -0.057645 -0.038986  0.806830   

            H1GH51     H1DA8     H1GH1    H1DA10  deltaBMI  
H4OD1Y   -0.244623 -0.106669 -0.003649 -0.123981 -0.094812  
BIO_SEX  -0.040788 -0.056995 -0.054540 -0.266654  0.034812  
H1DA6     0.018209 -0.032492  0.077017 -0.000909 -0.005084  
H4DA1     0.000619  0.174549 -0.015919  0.076226  0.079991  
H4SP1H    0.004515  0.036632  0.005841  0.031966  0.002367  
H4ED2    -0.034505 -0.103852  0.103365 -0.044745 -0.110930  
H4PE7     0.038913 -0.007208  0.027795 -0.027351  0.014994  
H4GH1     0.034059 -0.055316  0.150447 -0.005689 -0.122984  
H1ED14    0.011985 -0.072201  0.038801 -0.033571 -0.019515  
H1ED13   -0.001929 -0.067868  0.079014 -0.038302 -0.057645  
H1ED12    0.016559 -0.070515  0.053747 -0.005193 -0.038986  
BMI_08   -0.007152  0.087938 -0.141892  0.005908  0.806830  
H1GH51    1.000000  0.080806  0.015237  0.058700  0.040620  
H1DA8     0.080806  1.000000 -0.057370  0.203143  0.061717  
H1GH1     0.015237 -0.057370  1.000000  0.020072 -0.043018  
H1DA10    0.058700  0.203143  0.020072  1.000000  0.006581  
deltaBMI  0.040620  0.061717 -0.043018  0.006581  1.000000  
Out[38]:
<AxesSubplot:>
In [39]:
df_study.dropna(inplace=True)
In [40]:
df_study.isna().sum()
Out[40]:
AID         0
H4OD1Y      0
BIO_SEX     0
H1DA6       0
H4DA1       0
H4SP1H      0
H4ED2       0
H4PE7       0
H4GH1       0
H1ED14      0
H1ED13      0
H1ED12      0
BMI_08      0
H1GH51      0
H1DA8       0
H1GH1       0
H1DA10      0
deltaBMI    0
dtype: int64
In [41]:
from sklearn.preprocessing import scale

# Normalize the Data
X = df_study.drop(['AID', 'H4ED2'], axis=1)   ## Need to drop Unnamed: 0 
x = scale(X)
inputs = pd.DataFrame(x)
print(inputs.head())
inputs.columns = X.columns
target = pd.DataFrame(df_study[['H4ED2']])
target.reset_index(drop=True, inplace=True)  # reset index of target dataframe
data = pd.concat([inputs, target], axis=1).reindex(inputs.index)
print(data.head())
         0         1         2         3         4         5         6   \
0  1.906318 -1.068853  0.414104  3.405708  1.520813  0.628837  0.316181   
1 -1.006416 -1.068853 -2.414850  0.110109 -0.197299  0.628837  0.316181   
2 -1.006416 -1.068853 -2.414850 -0.622247  1.520813  0.628837  0.316181   
3  0.158678 -1.068853 -2.414850  1.062171 -0.770002  0.628837  0.316181   
4  0.158678 -1.068853  0.414104  0.549522  0.375405 -1.590237  0.316181   

         7         8         9         10        11        12        13  \
0 -0.703555 -0.754457 -0.618885 -0.331775 -1.320714  0.495267 -3.847738   
1 -0.703555 -0.754457 -0.618885 -0.356698  1.552580 -0.152470  0.259893   
2 -0.703555 -0.754457 -0.618885 -0.047244  1.552580  1.207777  0.259893   
3 -0.703555 -0.754457 -0.618885  1.999854 -0.602390 -0.411565  0.259893   
4 -0.703555  1.325457  1.615810 -0.112593  1.552580 -1.059301  0.259893   

         14        15  
0  1.268943 -0.787004  
1  1.268943 -0.247623  
2  1.268943  0.098083  
3  1.268943  0.240746  
4 -0.788057 -0.292085  
     H4OD1Y   BIO_SEX     H1DA6     H4DA1    H4SP1H     H4PE7     H4GH1  \
0  1.906318 -1.068853  0.414104  3.405708  1.520813  0.628837  0.316181   
1 -1.006416 -1.068853 -2.414850  0.110109 -0.197299  0.628837  0.316181   
2 -1.006416 -1.068853 -2.414850 -0.622247  1.520813  0.628837  0.316181   
3  0.158678 -1.068853 -2.414850  1.062171 -0.770002  0.628837  0.316181   
4  0.158678 -1.068853  0.414104  0.549522  0.375405 -1.590237  0.316181   

     H1ED14    H1ED13    H1ED12    BMI_08    H1GH51     H1DA8     H1GH1  \
0 -0.703555 -0.754457 -0.618885 -0.331775 -1.320714  0.495267 -3.847738   
1 -0.703555 -0.754457 -0.618885 -0.356698  1.552580 -0.152470  0.259893   
2 -0.703555 -0.754457 -0.618885 -0.047244  1.552580  1.207777  0.259893   
3 -0.703555 -0.754457 -0.618885  1.999854 -0.602390 -0.411565  0.259893   
4 -0.703555  1.325457  1.615810 -0.112593  1.552580 -1.059301  0.259893   

     H1DA10  deltaBMI  H4ED2  
0  1.268943 -0.787004      0  
1  1.268943 -0.247623      0  
2  1.268943  0.098083      0  
3  1.268943  0.240746      0  
4 -0.788057 -0.292085      0  

Dealing with outliers

In [42]:
# Box plots with numerical independent variables against the dependent variable 'HeartDisease'
numerical_features = data.select_dtypes(include=[np.number])

# Number of rows and columns in the numerical features DataFrame
num_rows, num_cols = numerical_features.shape

# Number of subplots required (excluding the dependent variable 'HeartDisease')
num_plots = num_cols - 1

# Setting the size of the box plot grid
plt.figure(figsize=(12, 8))

# Loop through each numerical independent variable and create the box plot with hue as 'HeartDisease'
for i, feature in enumerate(numerical_features.columns[:-1]):
    plt.subplot(4, 4, i + 1)
    sns.boxplot(data=data, x='H4ED2', y=feature, palette='coolwarm')
    plt.title(f"{feature} vs H4ED2")
    plt.xlabel('H4ED2 0:No, 1:Yes')
    plt.ylabel(feature)

plt.tight_layout()
plt.show()
In [43]:
# Inter quartile range

Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
In [44]:
## Number of outliers

((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))).sum()
Out[44]:
H4OD1Y        0
BIO_SEX       0
H1DA6       546
H4DA1       117
H4SP1H      124
H4PE7         0
H4GH1       339
H1ED14        0
H1ED13        0
H1ED12        0
BMI_08      110
H1GH51       21
H1DA8       207
H1GH1       236
H1DA10        0
deltaBMI    106
H4ED2         0
dtype: int64
In [45]:
## remove outliers from continuous features

Q1 = data[['BMI_08', 'H1GH51', 'H1DA8', 'deltaBMI']].quantile(0.25)
Q3 = data[['BMI_08', 'H1GH51', 'H1DA8', 'deltaBMI']].quantile(0.75)
IQR = Q3 - Q1

# Create masks for each column separately
mask_bmi = (data['BMI_08'] < (Q1['BMI_08'] - 1.5 * IQR['BMI_08'])) | (data['BMI_08'] > (Q3['BMI_08'] + 1.5 * IQR['BMI_08']))
mask_h1gh51 = (data['H1GH51'] < (Q1['H1GH51'] - 1.5 * IQR['H1GH51'])) | (data['H1GH51'] > (Q3['H1GH51'] + 1.5 * IQR['H1GH51']))
mask_h1da8 = (data['H1DA8'] < (Q1['H1DA8'] - 1.5 * IQR['H1DA8'])) | (data['H1DA8'] > (Q3['H1DA8'] + 1.5 * IQR['H1DA8']))
mask_deltabmi = (data['deltaBMI'] < (Q1['deltaBMI'] - 1.5 * IQR['deltaBMI'])) | (data['deltaBMI'] > (Q3['deltaBMI'] + 1.5 * IQR['deltaBMI']))

# Combine masks using logical OR (|) operator
mask_combined = mask_bmi | mask_h1gh51 | mask_h1da8 | mask_deltabmi

# # Replace outliers with NaN
# data[['BMI_08', 'H1GH51', 'H1DA8', 'deltaBMI']][mask_combined] = np.nan

# Replace outliers with NaN using .loc
data.loc[mask_combined, ['BMI_08', 'H1GH51', 'H1DA8', 'deltaBMI']] = np.nan
In [46]:
print(Q1,Q3,IQR)
BMI_08     -0.719252
H1GH51     -0.602390
H1DA8      -0.735433
deltaBMI   -0.686646
Name: 0.25, dtype: float64 BMI_08      0.519787
H1GH51      0.834257
H1DA8       0.300946
deltaBMI    0.529043
Name: 0.75, dtype: float64 BMI_08      1.239039
H1GH51      1.436647
H1DA8       1.036379
deltaBMI    1.215689
dtype: float64
In [47]:
data.isna().sum()
Out[47]:
H4OD1Y        0
BIO_SEX       0
H1DA6         0
H4DA1         0
H4SP1H        0
H4PE7         0
H4GH1         0
H1ED14        0
H1ED13        0
H1ED12        0
BMI_08      365
H1GH51      365
H1DA8       365
H1GH1         0
H1DA10        0
deltaBMI    365
H4ED2         0
dtype: int64
In [48]:
## remove outliers

data.dropna(inplace=True)
data.isna().sum()
Out[48]:
H4OD1Y      0
BIO_SEX     0
H1DA6       0
H4DA1       0
H4SP1H      0
H4PE7       0
H4GH1       0
H1ED14      0
H1ED13      0
H1ED12      0
BMI_08      0
H1GH51      0
H1DA8       0
H1GH1       0
H1DA10      0
deltaBMI    0
H4ED2       0
dtype: int64
In [49]:
print("-------Total Data Count------- \n\n",data.count())
print("\n\n\n-------Total Outlier Count - Pre Masking------- \n\n",mask_combined.sum())
print("\n\n\n-------Total Outlier Count - Post Masking------- \n\n",data.isna().sum())
-------Total Data Count------- 

 H4OD1Y      3365
BIO_SEX     3365
H1DA6       3365
H4DA1       3365
H4SP1H      3365
H4PE7       3365
H4GH1       3365
H1ED14      3365
H1ED13      3365
H1ED12      3365
BMI_08      3365
H1GH51      3365
H1DA8       3365
H1GH1       3365
H1DA10      3365
deltaBMI    3365
H4ED2       3365
dtype: int64



-------Total Outlier Count - Pre Masking------- 

 365



-------Total Outlier Count - Post Masking------- 

 H4OD1Y      0
BIO_SEX     0
H1DA6       0
H4DA1       0
H4SP1H      0
H4PE7       0
H4GH1       0
H1ED14      0
H1ED13      0
H1ED12      0
BMI_08      0
H1GH51      0
H1DA8       0
H1GH1       0
H1DA10      0
deltaBMI    0
H4ED2       0
dtype: int64
In [50]:
# Box plots with numerical independent variables against the dependent variable 'HeartDisease'
numerical_features = data.select_dtypes(include=[np.number])

# Number of rows and columns in the numerical features DataFrame
num_rows, num_cols = numerical_features.shape

# Number of subplots required (excluding the dependent variable 'HeartDisease')
num_plots = num_cols - 1

# Setting the size of the box plot grid
plt.figure(figsize=(12, 8))

# Loop through each numerical independent variable and create the box plot with hue as 'HeartDisease'
for i, feature in enumerate(numerical_features.columns[:-1]):
    plt.subplot(4, 4, i + 1)
    sns.boxplot(data=data, x='H4ED2', y=feature, palette='coolwarm')
    plt.title(f"{feature} vs H4ED2")
    plt.xlabel('H4ED2 0:No, 1:Yes')
    plt.ylabel(feature)

plt.tight_layout()
plt.show()
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [51]:
data = data.copy()
corr = data.corr()
print(corr)

sns.heatmap(corr.round(2) , annot=True)
            H4OD1Y   BIO_SEX     H1DA6     H4DA1    H4SP1H     H4PE7  \
H4OD1Y    1.000000 -0.071604 -0.076924 -0.008349 -0.075554  0.011552   
BIO_SEX  -0.071604  1.000000  0.099721 -0.041191 -0.015109 -0.018298   
H1DA6    -0.076924  0.099721  1.000000 -0.014693  0.034795 -0.000568   
H4DA1    -0.008349 -0.041191 -0.014693  1.000000  0.082124 -0.041333   
H4SP1H   -0.075554 -0.015109  0.034795  0.082124  1.000000 -0.011746   
H4PE7     0.011552 -0.018298 -0.000568 -0.041333 -0.011746  1.000000   
H4GH1    -0.015481 -0.039312  0.001370 -0.072012 -0.048170  0.083410   
H1ED14   -0.084165  0.093272  0.049562 -0.048813 -0.022390  0.009257   
H1ED13   -0.027091  0.095036  0.038747 -0.064843 -0.014254  0.017724   
H1ED12   -0.085013  0.056531  0.021412 -0.050684 -0.007170 -0.016071   
BMI_08    0.073694 -0.022400 -0.018175  0.077688 -0.037999  0.019439   
H1GH51   -0.271268 -0.058856  0.035623  0.002293  0.020721  0.034611   
H1DA8    -0.104651 -0.069069 -0.030314  0.177356  0.031506 -0.018940   
H1GH1    -0.005206 -0.048461  0.074085  0.010174  0.019977  0.027295   
H1DA10   -0.131538 -0.267121 -0.003912  0.078597  0.019762 -0.034499   
deltaBMI -0.097031  0.013902 -0.012729  0.079551 -0.008747  0.032914   
H4ED2     0.008495  0.113626  0.046733 -0.097646  0.000122  0.042900   

             H4GH1    H1ED14    H1ED13    H1ED12    BMI_08    H1GH51  \
H4OD1Y   -0.015481 -0.084165 -0.027091 -0.085013  0.073694 -0.271268   
BIO_SEX  -0.039312  0.093272  0.095036  0.056531 -0.022400 -0.058856   
H1DA6     0.001370  0.049562  0.038747  0.021412 -0.018175  0.035623   
H4DA1    -0.072012 -0.048813 -0.064843 -0.050684  0.077688  0.002293   
H4SP1H   -0.048170 -0.022390 -0.014254 -0.007170 -0.037999  0.020721   
H4PE7     0.083410  0.009257  0.017724 -0.016071  0.019439  0.034611   
H4GH1     1.000000  0.053954  0.055180  0.050764 -0.113879  0.031414   
H1ED14    0.053954  1.000000  0.372483  0.268470 -0.066918  0.021051   
H1ED13    0.055180  0.372483  1.000000  0.271728 -0.077842 -0.000013   
H1ED12    0.050764  0.268470  0.271728  1.000000 -0.084570  0.029485   
BMI_08   -0.113879 -0.066918 -0.077842 -0.084570  1.000000 -0.034574   
H1GH51    0.031414  0.021051 -0.000013  0.029485 -0.034574  1.000000   
H1DA8    -0.022869 -0.078040 -0.073332 -0.049168  0.067118  0.099610   
H1GH1     0.145482  0.029136  0.082236  0.053391 -0.098517  0.036594   
H1DA10    0.000406 -0.037197 -0.039126 -0.000511  0.009189  0.080219   
deltaBMI -0.076425 -0.034993 -0.043276 -0.032875  0.767552  0.043519   
H4ED2     0.126026  0.312511  0.328126  0.245254 -0.127514 -0.023128   

             H1DA8     H1GH1    H1DA10  deltaBMI     H4ED2  
H4OD1Y   -0.104651 -0.005206 -0.131538 -0.097031  0.008495  
BIO_SEX  -0.069069 -0.048461 -0.267121  0.013902  0.113626  
H1DA6    -0.030314  0.074085 -0.003912 -0.012729  0.046733  
H4DA1     0.177356  0.010174  0.078597  0.079551 -0.097646  
H4SP1H    0.031506  0.019977  0.019762 -0.008747  0.000122  
H4PE7    -0.018940  0.027295 -0.034499  0.032914  0.042900  
H4GH1    -0.022869  0.145482  0.000406 -0.076425  0.126026  
H1ED14   -0.078040  0.029136 -0.037197 -0.034993  0.312511  
H1ED13   -0.073332  0.082236 -0.039126 -0.043276  0.328126  
H1ED12   -0.049168  0.053391 -0.000511 -0.032875  0.245254  
BMI_08    0.067118 -0.098517  0.009189  0.767552 -0.127514  
H1GH51    0.099610  0.036594  0.080219  0.043519 -0.023128  
H1DA8     1.000000 -0.031943  0.242220  0.069255 -0.098169  
H1GH1    -0.031943  1.000000  0.026874 -0.007508  0.089868  
H1DA10    0.242220  0.026874  1.000000  0.013669 -0.052048  
deltaBMI  0.069255 -0.007508  0.013669  1.000000 -0.099996  
H4ED2    -0.098169  0.089868 -0.052048 -0.099996  1.000000  
Out[51]:
<AxesSubplot:>
In [52]:
df_depr=data.copy() ## data is scaled

# df_depr=df_study.copy() ## data is unscaled


df_depr.isna().sum()
df_depr.dropna(inplace=True)
df_depr.isna().sum()
Out[52]:
H4OD1Y      0
BIO_SEX     0
H1DA6       0
H4DA1       0
H4SP1H      0
H4PE7       0
H4GH1       0
H1ED14      0
H1ED13      0
H1ED12      0
BMI_08      0
H1GH51      0
H1DA8       0
H1GH1       0
H1DA10      0
deltaBMI    0
H4ED2       0
dtype: int64
In [53]:
df_depr.describe()
Out[53]:
H4OD1Y BIO_SEX H1DA6 H4DA1 H4SP1H H4PE7 H4GH1 H1ED14 H1ED13 H1ED12 BMI_08 H1GH51 H1DA8 H1GH1 H1DA10 deltaBMI H4ED2
count 3365.000000 3365.000000 3365.000000 3365.000000 3365.000000 3365.000000 3365.000000 3365.000000 3365.000000 3365.000000 3365.000000 3365.000000 3365.000000 3365.000000 3365.000000 3365.000000 3365.000000
mean 0.013604 -0.006770 -0.000361 -0.030356 -0.007022 0.006309 0.037040 0.006852 0.018789 0.017987 -0.128448 -0.009800 -0.178264 0.035286 -0.017217 -0.108114 0.364636
std 0.997907 1.000577 1.000509 0.970099 0.979417 0.997091 0.945228 1.002582 1.005324 1.008915 0.819392 0.944172 0.674064 0.934032 0.995851 0.835905 0.481399
min -1.588963 -1.068853 -2.414850 -0.988425 -3.060817 -1.590237 -3.162744 -0.703555 -0.754457 -0.618885 -1.951934 -2.757361 -1.059301 -3.847738 -0.788057 -2.501573 0.000000
25% -1.006416 -1.068853 0.414104 -0.622247 -0.770002 -1.590237 0.316181 -0.703555 -0.754457 -0.618885 -0.741173 -0.602390 -0.735433 0.259893 -0.788057 -0.710764 0.000000
50% 0.158678 0.935582 0.414104 -0.256069 -0.197299 0.628837 0.316181 -0.703555 -0.754457 -0.618885 -0.263333 0.115933 -0.411565 0.259893 -0.788057 -0.193374 0.000000
75% 0.741224 0.935582 0.414104 0.476286 0.375405 0.628837 0.316181 1.421352 1.325457 1.615810 0.380330 0.834257 0.236172 0.259893 1.268943 0.411924 1.000000
max 3.071411 0.935582 0.414104 9.264552 3.238924 0.628837 0.316181 1.421352 1.325457 1.615810 2.365876 2.989227 1.661193 0.259893 1.268943 2.346618 1.000000
In [54]:
num_feats = df_depr.select_dtypes(include=[np.number])
plt.figure(figsize=(12, 8))
for i, feature in enumerate(num_feats.columns):
    plt.subplot(5, 4, i+1)
    sns.histplot(data=num_feats, x=feature, kde=True)
    plt.title(f"{feature} Distribution")
plt.tight_layout()
plt.show()

Feature engineering

Carrying out feature engineering to find out top 3 features

In [55]:
# Train each of the Models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

The dataset target is distributed unevenly.

Undersampling and Upsampling dataset

In [56]:
df_depr.shape
Out[56]:
(3365, 17)
In [57]:
print("Number of participants with education level college or lower (H4ED2 == 0):", (df_depr['H4ED2']==0).sum())
print("Number of participants with education level university or higher(H4ED2 == 1):", (df_depr['H4ED2']==1).sum())
Number of participants with education level college or lower (H4ED2 == 0): 2138
Number of participants with education level university or higher(H4ED2 == 1): 1227
In [58]:
# https://elitedatascience.com/imbalanced-classes
from sklearn.utils import resample

#['unacc', 'acc', 'vgood', 'good']
# Separate majority and minority classes
Depress_no = df_depr[df_depr['H4ED2']==0]
Depress_yes = df_depr[df_depr['H4ED2']==1]


rand_state = 123

# Upsample minority class and downsampling majority class
Depress_no_upsampled = resample(Depress_no, 
                                 replace=False,     # sample with replacement
                                 n_samples=1500,    # to match majority class
                                 random_state=rand_state) # reproducible results

Depress_yes_upsampled = resample(Depress_yes, 
                                 replace=False,     # sample with replacement
                                 n_samples=(df_depr['H4ED2']==1).sum(),    # to match majority class
                                 random_state=rand_state) # reproducible results   
    
Depress_yes_upsampled_repeat = resample(Depress_yes, 
                                 replace=True,     # sample with replacement
                                 n_samples=1500 - (df_depr['H4ED2']==1).sum(),    # to match majority class
                                 random_state=rand_state) # reproducible results   


# Combine majority class with upsampled minority class
df_depr_upsampled = pd.concat([Depress_no_upsampled, Depress_yes_upsampled, Depress_yes_upsampled_repeat])
 
# Display new class counts
df_depr_upsampled['H4ED2'].value_counts()
Out[58]:
0    1500
1    1500
Name: H4ED2, dtype: int64
In [59]:
data = df_depr_upsampled.copy()
corr = data.corr()
# print(corr)

sns.heatmap(corr.round(2) , annot=True)
Out[59]:
<AxesSubplot:>
In [60]:
num_feats = df_depr_upsampled.select_dtypes(include=[np.number])
plt.figure(figsize=(12, 8))
for i, feature in enumerate(num_feats.columns):
    plt.subplot(5, 4, i+1)
    sns.histplot(data=num_feats, x=feature, kde=True)
    plt.title(f"{feature} Distribution")
plt.tight_layout()
plt.show()
In [61]:
df_depr_upsampled.isna().sum()
Out[61]:
H4OD1Y      0
BIO_SEX     0
H1DA6       0
H4DA1       0
H4SP1H      0
H4PE7       0
H4GH1       0
H1ED14      0
H1ED13      0
H1ED12      0
BMI_08      0
H1GH51      0
H1DA8       0
H1GH1       0
H1DA10      0
deltaBMI    0
H4ED2       0
dtype: int64
In [62]:
# print(df_depr_upsampled.dtypes)

# df_depr_upsampled['H4ED2'] = df_depr_upsampled['H4ED2'].astype(int)
# print(df_depr_upsampled.dtypes)

Feature engineering:

Feature engineering trial 1

In [63]:
X = df_depr_upsampled.drop(columns = ['H4ED2'])
y = df_depr_upsampled['H4ED2']


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)
In [64]:
from sklearn.feature_selection import mutual_info_classif
import numpy as np

# Assuming you have X and y data loaded

# Calculate the mutual information scores
mi_scores = mutual_info_classif(X, y)

# Create a dictionary of feature importance scores
feature_scores = {}
for i, feature in enumerate(X.columns):
    feature_scores[feature] = mi_scores[i]

# Sort the features by importance score in descending order
sorted_features = sorted(feature_scores.items(), key=lambda x: x[1], reverse=True)

# Print the sorted features
for feature, score in sorted_features:
    print('Feature:', feature, 'Score:', score)
    
    
# ________________________________mutual_info_classif____________________________________________________________________    
    
# Plot a horizontal bar chart of the feature importance scores
fig, ax = plt.subplots()
y_pos = np.arange(len(sorted_features))
ax.barh(y_pos, [score for feature, score in sorted_features], align="center")
ax.set_yticks(y_pos)
ax.set_yticklabels([feature for feature, score in sorted_features])
ax.invert_yaxis()  # Labels read top-to-bottom
ax.set_xlabel("Importance Score")
ax.set_title("Feature Importance Scores (Information Gain)")

# Add importance scores as labels on the horizontal bar chart
for i, v in enumerate([score for feature, score in sorted_features]):
    ax.text(v + 0.01, i, str(round(v, 3)), color="black", fontweight="bold")
plt.show()
Feature: H1ED13 Score: 0.05724579284312736
Feature: BMI_08 Score: 0.037477186558100284
Feature: H1ED14 Score: 0.03655662411921301
Feature: deltaBMI Score: 0.0330618536267846
Feature: H4SP1H Score: 0.027262553450683846
Feature: H1ED12 Score: 0.021749605837021235
Feature: H4OD1Y Score: 0.018331205911926274
Feature: H1GH51 Score: 0.015467362979546095
Feature: H1DA8 Score: 0.013134548461974482
Feature: H4PE7 Score: 0.011403419243543711
Feature: H1GH1 Score: 0.008710577519277951
Feature: H4GH1 Score: 0.0039630079266981255
Feature: BIO_SEX Score: 0.0
Feature: H1DA6 Score: 0.0
Feature: H4DA1 Score: 0.0
Feature: H1DA10 Score: 0.0
In [65]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.feature_selection import mutual_info_regression


# Apply Information Gain
ig = mutual_info_regression(X, y)

# Create a dictionary of feature importance scores
feature_scores = {}
for i in range(len(X.columns)):
    feature_scores[X.columns[i]] = ig[i]
    

# Sort the features by importance score in descending order
sorted_features = sorted(feature_scores.items(), key=lambda x: x[1], reverse=True)

# Print the feature importance scores and the sorted features
for feature, score in sorted_features:
    print('Feature:', feature, 'Score:', score)
    
    
# ________________________________________mutual_info_regression___________________________________________________________    
    
# Plot a horizontal bar chart of the feature importance scores
fig, ax = plt.subplots()
y_pos = np.arange(len(sorted_features))
ax.barh(y_pos, [score for feature, score in sorted_features], align="center")
ax.set_yticks(y_pos)
ax.set_yticklabels([feature for feature, score in sorted_features])
ax.invert_yaxis()  # Labels read top-to-bottom
ax.set_xlabel("Importance Score")
ax.set_title("Feature Importance Scores (Information Gain)")

# Add importance scores as labels on the horizontal bar chart
for i, v in enumerate([score for feature, score in sorted_features]):
    ax.text(v + 0.01, i, str(round(v, 3)), color="black", fontweight="bold")
plt.show()
Feature: H1ED14 Score: 0.05342020713142137
Feature: H4GH1 Score: 0.042461356126936245
Feature: BMI_08 Score: 0.0375331570939883
Feature: H1ED13 Score: 0.03390560349124083
Feature: H4DA1 Score: 0.033409037154273236
Feature: deltaBMI Score: 0.03319838448486667
Feature: H1ED12 Score: 0.01796508019597365
Feature: H1GH1 Score: 0.013616690418756683
Feature: H4OD1Y Score: 0.013511764727283548
Feature: H1DA8 Score: 0.01164402340109394
Feature: H1DA6 Score: 0.011463257759896095
Feature: BIO_SEX Score: 0.009296292685357699
Feature: H1GH51 Score: 0.006478845066328454
Feature: H4SP1H Score: 0.0
Feature: H4PE7 Score: 0.0
Feature: H1DA10 Score: 0.0
In [66]:
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression  # Replace with your desired model

# Assuming you have X and y data loaded

# Create an instance of the model you want to use for feature selection
model = LinearRegression()  # Replace with your desired model

# Create the RFECV object with your model and scoring metric
rfecv = RFECV(estimator=model, scoring='neg_mean_squared_error', cv=10)

# Fit the RFECV to your data
rfecv.fit(X, y)

# Create a dictionary of feature importance scores
feature_scores = {}
for i, feature in enumerate(X.columns):
    feature_scores[feature] = rfecv.support_[i]

# Sort the features by importance score in descending order
sorted_features = sorted(feature_scores.items(), key=lambda x: x[1], reverse=True)

# Print the sorted features
for feature, score in sorted_features:
    print('Feature:', feature, 'Selected:', score)
Feature: H4OD1Y Selected: True
Feature: BIO_SEX Selected: True
Feature: H1DA6 Selected: True
Feature: H4DA1 Selected: True
Feature: H4SP1H Selected: True
Feature: H4PE7 Selected: True
Feature: H4GH1 Selected: True
Feature: H1ED14 Selected: True
Feature: H1ED13 Selected: True
Feature: H1ED12 Selected: True
Feature: BMI_08 Selected: True
Feature: H1GH51 Selected: True
Feature: H1DA8 Selected: True
Feature: H1GH1 Selected: True
Feature: deltaBMI Selected: True
Feature: H1DA10 Selected: False

Feature engineering trial 2

In [67]:
# a
df_depr_upsampled.columns
Out[67]:
Index(['H4OD1Y', 'BIO_SEX', 'H1DA6', 'H4DA1', 'H4SP1H', 'H4PE7', 'H4GH1',
       'H1ED14', 'H1ED13', 'H1ED12', 'BMI_08', 'H1GH51', 'H1DA8', 'H1GH1',
       'H1DA10', 'deltaBMI', 'H4ED2'],
      dtype='object')
In [68]:
X = df_depr_upsampled.drop(columns = ['H4ED2','BMI_08','H1GH51'])
y = df_depr_upsampled['H4ED2']


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=20)
In [69]:
from sklearn.feature_selection import mutual_info_classif
import numpy as np

# Assuming you have X and y data loaded

# Calculate the mutual information scores
mi_scores = mutual_info_classif(X, y)

# Create a dictionary of feature importance scores
feature_scores = {}
for i, feature in enumerate(X.columns):
    feature_scores[feature] = mi_scores[i]

# Sort the features by importance score in descending order
sorted_features = sorted(feature_scores.items(), key=lambda x: x[1], reverse=True)

# Print the sorted features
for feature, score in sorted_features:
    print('Feature:', feature, 'Score:', score)
    
    
# ________________________________mutual_info_classif____________________________________________________________________    
    
# Plot a horizontal bar chart of the feature importance scores
fig, ax = plt.subplots()
y_pos = np.arange(len(sorted_features))
ax.barh(y_pos, [score for feature, score in sorted_features], align="center")
ax.set_yticks(y_pos)
ax.set_yticklabels([feature for feature, score in sorted_features])
ax.invert_yaxis()  # Labels read top-to-bottom
ax.set_xlabel("Importance Score")
ax.set_title("Feature Importance Scores (Information Gain)")

# Add importance scores as labels on the horizontal bar chart
for i, v in enumerate([score for feature, score in sorted_features]):
    ax.text(v + 0.01, i, str(round(v, 3)), color="black", fontweight="bold")
plt.show()
Feature: H1ED13 Score: 0.04869344179523227
Feature: H1ED14 Score: 0.04758499246512016
Feature: deltaBMI Score: 0.0330618536267846
Feature: H1ED12 Score: 0.018827366676939183
Feature: H4SP1H Score: 0.012808214341623492
Feature: H1DA6 Score: 0.012235103894387311
Feature: H1DA10 Score: 0.010642426001319372
Feature: H4PE7 Score: 0.008474994801342284
Feature: H1DA8 Score: 0.007227317995577076
Feature: H4DA1 Score: 0.006271560627939543
Feature: H4OD1Y Score: 0.0058153704803010076
Feature: BIO_SEX Score: 0.004886778184797214
Feature: H4GH1 Score: 0.0
Feature: H1GH1 Score: 0.0
In [70]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.feature_selection import mutual_info_regression


# Apply Information Gain
ig = mutual_info_regression(X, y)

# Create a dictionary of feature importance scores
feature_scores = {}
for i in range(len(X.columns)):
    feature_scores[X.columns[i]] = ig[i]
    

# Sort the features by importance score in descending order
sorted_features = sorted(feature_scores.items(), key=lambda x: x[1], reverse=True)

# Print the feature importance scores and the sorted features
for feature, score in sorted_features:
    print('Feature:', feature, 'Score:', score)
    
    
# ________________________________________mutual_info_regression___________________________________________________________    
    
# Plot a horizontal bar chart of the feature importance scores
fig, ax = plt.subplots()
y_pos = np.arange(len(sorted_features))
ax.barh(y_pos, [score for feature, score in sorted_features], align="center")
ax.set_yticks(y_pos)
ax.set_yticklabels([feature for feature, score in sorted_features])
ax.invert_yaxis()  # Labels read top-to-bottom
ax.set_xlabel("Importance Score")
ax.set_title("Feature Importance Scores (Information Gain)")

# Add importance scores as labels on the horizontal bar chart
for i, v in enumerate([score for feature, score in sorted_features]):
    ax.text(v + 0.01, i, str(round(v, 3)), color="black", fontweight="bold")
plt.show()
Feature: H1ED13 Score: 0.05646566715242329
Feature: H1ED14 Score: 0.05253253637207145
Feature: deltaBMI Score: 0.03310553688267781
Feature: H1ED12 Score: 0.02807843045191749
Feature: H1GH1 Score: 0.02554936276728892
Feature: H4GH1 Score: 0.02554887821342522
Feature: H4SP1H Score: 0.02000950984118255
Feature: H1DA8 Score: 0.01835607195894262
Feature: H4OD1Y Score: 0.015455449349830452
Feature: H4PE7 Score: 0.00730664932938474
Feature: BIO_SEX Score: 0.0
Feature: H1DA6 Score: 0.0
Feature: H4DA1 Score: 0.0
Feature: H1DA10 Score: 0.0
In [71]:
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression  # Replace with your desired model

# Assuming you have X and y data loaded

# Create an instance of the model you want to use for feature selection
model = LinearRegression()  # Replace with your desired model

# Create the RFECV object with your model and scoring metric
rfecv = RFECV(estimator=model, scoring='neg_mean_squared_error', cv=10)

# Fit the RFECV to your data
rfecv.fit(X, y)

# Create a dictionary of feature importance scores
feature_scores = {}
for i, feature in enumerate(X.columns):
    feature_scores[feature] = rfecv.support_[i]

# Sort the features by importance score in descending order
sorted_features = sorted(feature_scores.items(), key=lambda x: x[1], reverse=True)

# Print the sorted features
for feature, score in sorted_features:
    print('Feature:', feature, 'Selected:', score)
Feature: H4OD1Y Selected: True
Feature: BIO_SEX Selected: True
Feature: H4DA1 Selected: True
Feature: H4PE7 Selected: True
Feature: H4GH1 Selected: True
Feature: H1ED14 Selected: True
Feature: H1ED13 Selected: True
Feature: H1ED12 Selected: True
Feature: H1DA8 Selected: True
Feature: H1GH1 Selected: True
Feature: deltaBMI Selected: True
Feature: H1DA6 Selected: False
Feature: H4SP1H Selected: False
Feature: H1DA10 Selected: False

removed features with least scores

In [72]:
## Building 3 supervised models

# dt = DecisionTreeClassifier()
# dt.fit(X, y)
# rf = RandomForestClassifier()
# rf.fit(X, y)
# lr = LogisticRegression()
# lr.fit(X, y)
In [73]:
# imoprtance scores of features using decision tree 

# importance = dt.feature_importances_
# for i,v in enumerate(importance):
#     print('Feature: %0d, Score: %.5f' % (i,v))
In [74]:
# imoprtance scores of features using random forest

# importance = rf.feature_importances_
# for i,v in enumerate(importance):
#     print('Feature: %0d, Score: %.5f' % (i,v))
In [75]:
# imoprtance scores of features using logitic regression

# importance = lr.coef_[0]
# for i,v in enumerate(importance):
#     print('Feature: %0d, Score: %.5f' % (i,v))

Feature engineering observation:

The top 3 features are:

Feature 4 -

Feature 8 -

Feature 9 -

Ensemble modelling

Buid ensemble model using following 4 models

  1. KNeighbors
  2. RandomForest
  3. LogisticRegression
  4. Gaussian Naive Bayes

Model 1 : KNeighbors

In [76]:
# Model 1 : KNeighbors

import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
params_knn = {'n_neighbors': np.arange(1, 25)}
knn_gs = GridSearchCV(knn, params_knn, cv=10)
knn_gs.fit(X_train, y_train)


knn_best = knn_gs.best_estimator_
print(knn_gs.best_params_)
{'n_neighbors': 22}
In [77]:
# import metrics class
from sklearn import metrics
In [78]:
y_pred_proba = knn_gs.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)

auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="Predict Algorithm, auc="+str(round(auc,3)))
plt.legend(loc=4)
plt.show()
In [79]:
y_pred = knn_gs.predict(X_test)
# import metrics class
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix
Out[79]:
array([[258, 121],
       [128, 243]], dtype=int64)
In [80]:
print("Sensitivity",metrics.recall_score(y_test, y_pred))
print("Specificity:",metrics.precision_score(y_test, y_pred))
Sensitivity 0.6549865229110512
Specificity: 0.6675824175824175

Model 2 : RandomForest

In [81]:
# Model 2 : RandomForest

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
params_rf = {'n_estimators': [50, 100, 200]}
rf_gs = GridSearchCV(rf, params_rf, cv=10)
rf_gs.fit(X_train, y_train)

#Store the Second Model’s Results

rf_best = rf_gs.best_estimator_
print(rf_gs.best_params_)
{'n_estimators': 200}
In [82]:
y_pred_proba = rf_gs.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="Predict Algorithm, auc="+str(round(auc,3)))
plt.legend(loc=4)
plt.show()
In [83]:
y_pred = rf_gs.predict(X_test)
# import metrics class
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix
Out[83]:
array([[267, 112],
       [103, 268]], dtype=int64)
In [84]:
print("Sensitivity",metrics.recall_score(y_test, y_pred))
print("Specificity:",metrics.precision_score(y_test, y_pred))
Sensitivity 0.7223719676549866
Specificity: 0.7052631578947368

Model 3: Logistic regression

In [85]:
# Model 3: Logistic regression
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
Out[85]:
LogisticRegression()
In [86]:
# logreg = LogisticRegression()
# logModel = logreg.fit(X_train,y_train)
y_pred_proba = logreg.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="Predict Algorithm, auc="+str(round(auc,3)))
plt.legend(loc=4)
plt.show()
In [87]:
y_pred = logreg.predict(X_test)
# import metrics class
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix
Out[87]:
array([[272, 107],
       [134, 237]], dtype=int64)
In [88]:
print("Sensitivity",metrics.recall_score(y_test, y_pred))
print("Specificity:",metrics.precision_score(y_test, y_pred))
Sensitivity 0.6388140161725068
Specificity: 0.688953488372093

Model 4: Gaussian Naive Bayes

In [89]:
# GaussianNB
from sklearn.naive_bayes import GaussianNB
model_gb = GaussianNB()
model_gb.fit(X_train, y_train)
Out[89]:
GaussianNB()
In [90]:
y_pred_proba = model_gb.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="Predict Algorithm, auc="+str(round(auc,3)))
plt.legend(loc=4)
plt.show()
In [91]:
y_pred = model_gb.predict(X_test)
# import metrics class
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix
Out[91]:
array([[220, 159],
       [ 81, 290]], dtype=int64)
In [92]:
print("Sensitivity",metrics.recall_score(y_test, y_pred))
print("Specificity:",metrics.precision_score(y_test, y_pred))
Sensitivity 0.7816711590296496
Specificity: 0.6458797327394209

checking scores

In [93]:
print('knn: {}'.format(knn_best.score(X_test, y_test)))
print('rf: {}'.format(rf_best.score(X_test, y_test)))
print('log_reg: {}'.format(logreg.score(X_test, y_test)))
print('gnb: {}'.format(model_gb.score(X_test, y_test)))
knn: 0.668
rf: 0.7133333333333334
log_reg: 0.6786666666666666
gnb: 0.68

Ensemble

Ensemble model of all 4 models

In [94]:
# ensemble all models
from sklearn.ensemble import VotingClassifier
estimators=[('knn', knn_best), ('rf', rf_best), ('log_reg', logreg), ('gnb',model_gb)]
ensemble = VotingClassifier(estimators, voting='soft')   # voting hard
ensemble.fit(X_train, y_train)
Out[94]:
VotingClassifier(estimators=[('knn', KNeighborsClassifier(n_neighbors=22)),
                             ('rf', RandomForestClassifier(n_estimators=200)),
                             ('log_reg', LogisticRegression()),
                             ('gnb', GaussianNB())],
                 voting='soft')
In [95]:
y_pred_proba = ensemble.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="Predict Algorithm, auc="+str(round(auc,5)))
plt.legend(loc=4)
plt.show()
In [96]:
y_pred = ensemble.predict(X_test)
# import metrics class
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix
Out[96]:
array([[253, 126],
       [105, 266]], dtype=int64)
In [97]:
print("Sensitivity",metrics.recall_score(y_test, y_pred))
print("Specificity:",metrics.precision_score(y_test, y_pred))
Sensitivity 0.7169811320754716
Specificity: 0.6785714285714286
In [98]:
import matplotlib.pyplot as plt

# Assuming you have the arrays: y_pred, y_test, y_train, y_pred_train, all containing 0s and 1s

y_pred_train = ensemble.predict(X_train)


# Count the frequency of 0s and 1s in each array
unique_pred, counts_pred = np.unique(y_pred, return_counts=True)
unique_test, counts_test = np.unique(y_test, return_counts=True)
unique_train, counts_train = np.unique(y_train, return_counts=True)
unique_pred_train, counts_pred_train = np.unique(y_pred_train, return_counts=True)

# Create a grouped bar plot
width = 0.2  # Width of the bars
x = np.arange(len(unique_pred))

fig, ax = plt.subplots()
rects1 = ax.bar(x - 1.5*width, counts_pred, width, label='y_pred')
rects2 = ax.bar(x - 0.5*width, counts_test, width, label='y_test')
rects3 = ax.bar(x + 0.5*width, counts_train, width, label='y_train')
rects4 = ax.bar(x + 1.5*width, counts_pred_train, width, label='y_pred_train')

ax.set_xlabel('Predicted Value')
ax.set_ylabel('Frequency')
ax.set_title('Frequency of 0s and 1s in y_pred, y_test, y_train, and y_pred_train')
ax.set_xticks(x)
ax.set_xticklabels(unique_pred)
ax.legend()

plt.show()

Hyperparameter tuning (Out of scope)

Taking long time to execute code

Hyperparameter tuning is an essential step in optimizing the performance of a classification model.

In [99]:
# from sklearn.ensemble import RandomForestClassifier

# # Create the model
# model = RandomForestClassifier()


# # Define the Hyperparameter Grid:
# # Create a dictionary of hyperparameters and their corresponding values that you want to tune. 
# # Choose a range of values to search for each hyperparameter.



# param_grid = {
#     'n_estimators': [100, 200, 300],          # Number of trees in the forest
#     'max_depth': [None, 5, 10, 20],          # Maximum depth of the tree
#     'min_samples_split': [2, 5, 10],         # Minimum number of samples required to split an internal node
#     'min_samples_leaf': [1, 2, 4]            # Minimum number of samples required to be at a leaf node
# }
In [100]:
# Perform Grid Search or Randomized Search:

# from sklearn.model_selection import GridSearchCV

# # Create the GridSearchCV object
# grid_search = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy', cv=5)

# # Fit the GridSearchCV to the training data
# grid_search.fit(X_train, y_train)

# # Get the best hyperparameters
# best_params = grid_search.best_params_
# print("Best Hyperparameters:", best_params)
In [101]:
# # Train the model with the best hyperparameters
# best_model = RandomForestClassifier(**best_params)
# best_model.fit(X_train, y_train)

# # Predict on the test set
# y_pred = best_model.predict(X_test)

# # Evaluate the model
# from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# accuracy = accuracy_score(y_test, y_pred)
# conf_matrix = confusion_matrix(y_test, y_pred)
# classification_rep = classification_report(y_test, y_pred)

# print("Accuracy:", accuracy)
# print("Confusion Matrix:\n", conf_matrix)
# print("Classification Report:\n", classification_rep)

Conclusion

17 features were shortlisted from the dataset for investigation of wether the participants will end up studying till "graduate level or more" or "stop at college level".

Features BMI_08 and BMI_94 were calculated based on particapants height and weight information. deltaBMI is difference between BMI_08 and BMI_94 indicating participants overall BMI / health trent.

Interquantile range method was used to remove outliers from numerical features. Q1 and Q3 quantiles were calculated to estimate IQR Interquantile range. Features had non-uniform units - kg, lbs, hrs, yes/no etc. The dataset was scaled to remove unnecesary influance of feature units.

According to feature scoring algorithm the top three features with mutual_info_classifier were H1ED13 Score: 0.057, BMI_08 Score: 0.037, H1ED14 Score: 0.036 mutual_info_regression were H1ED14 Score: 0.053, H4GH1 Score: 0.042,BMI_08 Score: 0.037 In addication, Recursive Feature Elimination, Cross-Validated (RFECV) feature selection method was given a try in the feature engineering process. Based on feature scores, two features were removed and remaining 15 features were used to train classification ML models.

An ensemle model ( Sensitivity 0.71, Specificity: 0.68 AUC: 0.77) was generated using using following 4 models - KNeighbors (score: 0.668) , RandomForest (score: 0.7133) , LogisticRegression (score: 0.678), Gaussian Naive Bayes(score: 0.68).

Future Scope

Hyperparameter tuning is an essential step in optimizing the performance of a classification model. This is a time intensive process and due to time constraints, this step was not implemented in this report. In future, Hyperparameter tuning could be implemented to improve performance of models.

References:

  1. Harris, Kathleen Mullan; Udry, Richard J., 2015, "National Longitudinal Study of Adolescent to Adult Health (Add Health) Wave I, 1994-1995", https://doi.org/10.15139/S3/11900, UNC Dataverse, V3 https://dataverse.unc.edu/dataset.xhtml?persistentId=doi:10.15139/S3/11900

  2. Harris, Kathleen Mullan; Udry, Richard J., 2015, "National Longitudinal Study of Adolescent to Adult Health (Add Health) Wave IV, 2008", https://doi.org/10.15139/S3/11920, UNC Dataverse, V3. https://dataverse.unc.edu/dataset.xhtml?persistentId=doi:10.15139/S3/11920

  3. Lecy, N., Osteen, P. The Effects of Childhood Trauma on College Completion. Res High Educ 63, 1058–1072 (2022). https://doi.org/10.1007/s11162-022-09677-9

  4. The National Longitudinal Study of Adolescent to Adult Health (Add Health) https://addhealth.cpc.unc.edu/

  5. ODUM INSTITUTE DATA ARCHIVE https://odum.unc.edu/archive/

  6. Heatmaps in Python : https://plotly.com/python/heatmaps/

  7. Setting the Font, Title, Legend Entries, and Axis Titles in Python: https://plotly.com/python/figure-labels/

  8. EDA: https://github.com/SaurabhPrabhu94/ANLY-530-Group-Project-Heart/tree/main

  9. Binary Classification https://www.learndatasci.com/glossary/binary-classification/#:~:text=Now%2C%20for%20the%20targets%3A%20dataset%20%5B%27target%27%5D.head%20%28%29%200,1%20357%200%20212%20Name%3A%20target%2C%20dtype%3A%20int64

  10. Brown, D. W., Anda, R. F., Tiemeier, H., Felitti, V. J., Edwards, V. J., Croft, J. B., & Giles, W. H. (2009). Adverse childhood experiences and the risk of premature mortality. American Journal of Preventive Medicine, 37, 389–396. https://doi.org/10.1016/j.amepre.2009.06.021

Lecture notes:

01 Data Preprocessing I - Intro to Jupyter, Pandas Dataframe objects, and Handling Missing

02 Data Preprocessing II - DF Mechanics - Joining, Merging, and Concatenation

03 Visual Data Exploration

04 Frequent Pattern Mining and Unsupervised Machine Learning

05 Frequent Pattern Mining II

06 Generalized Linear Models, Supervised Machine Learning Intro and Classification